Internet Info 1994 March

home *** CD-ROM | disk | FTP | other *** search

/ Internet Info 1994 March / Internet Info CD-ROM (Walnut Creek) (March 1994).iso / networking / terms / kermit / charsets / cyrillic-summary.txt < prev next >

Wrap

Internet Message Format | 1993-03-12 | 29KB

From: Andras Kornai <andras@calera.com> Subject: Re: One more kermit question To: fdc@watsun.cc.columbia.edu (Frank da Cruz) Date: Thu, 11 Mar 93 21:33:45 PST ---------------------------------------------------------------------- CYRILLIC ENCODING FAQ Version 1.3, March 13 1993 ACKNOWLEDGEMENTS Most of the information was provided by the following: David J. Birnbaum <djbpitt+@pitt.edu> Frank da Cruz <fdc@watsun.cc.columbia.edu> Bur Davis <bdavis@adobe.com> George Fowler <gfowler@ucs.indiana.edu> Richard B. Paine <RPAINE@CCNODE.Colorado.EDU> Slava Paperno <PAPY@CORNELLA.cit.cornell.edu> Keld J. Simonsen <Keld.Simonsen@dkuug.dk> Glenn E. Thobe <thobe@getunx.info.com> Dimitri Vulis <DLV@CUNYVMS1.BITNET> Johan W. van Wingen <precal@rulmvs.leidenuniv.nl> Thanks to all who contributed -- I am responsible for the errors that still remain. Andras Kornai (andras@calera.com, kornai@csli.stanford.edu) Q: What are the commonly used computer encodings for Cyrillic? A: Broadly speaking, there are three kinds of schemes in use: those that replace Cyrillic characters by 7-bit ascii values, those that use the full 8-bit range 0-255, and those using multi-byte codes. Presently only the first two types are in wide use, but for reference purposes I will also discuss the third type. Q: What kind of transliteration schemes are there? A: The most important one is called KOI-7: the Russian alphabet is given by the ASCII characters (note the exchange of upper and lower cases): UPPER CASE: abwgde$vzijklmnoprstufhc~{}"yx|`q lower case: ABWGDE#VZIJKLMNOPRSTUFHC^[]_YX\@Q The following extensions to the official standard KOI-7 are supported in Glenn Thobe's conversion programs for invertibility: '"'=YER, '#'=yo, '$'=YO, '<'=left guillemet, '>'=right guillemet. A slightly different (multicharacter) scheme is employed by Steve Gaardner's (gaarder@theory.tc.cornell.edu) conversion code from Old KOI-8, included below. This particular scheme provides easy readability but suffers from some transliteration weirdness, such as mapping short ii and yeri on the same character. Since proper transliteration often requires context-sensitive rules, and differs from language to language within the same script, a fuller discussion is beyond the scope of the present document. For an overview of the major Cyrillic to Latin transliteration schemes used in the US, see pp 457-460 of the Style Manual of the US Government Printing Office, for sale by the Superintendent of Documents, USGPO, Washington DC 20402, Stock Number 021-000-00120-1 (paper) or 021-000-00120-0 (hardbound). See also the Chicago Manual of Style, and Transliteracija russkikh slov latinskimi bukvami, GOST 167876-71 #include <stdio.h> char transtbl[64][5] = {"yu", "a", "b", "ts", "d" , "e", "f", "g", "kh", "i", "y" , "k", "l", "m", "n", "o", "p", "ya", "r" , "s", "t", "u", "zh", "v", "'", "y", "z", "sh", "e", "shch", "ch", "`", "YU", "A", "B", "TS", "D" , "E", "F", "G", "KH", "I", "Y" , "K", "L", "M", "N", "O", "P", "YA", "R" , "S", "T", "U", "ZH", "V", "'", "Y", "Z", "SH", "E", "SHCH", "CH", "`" }; main() { int c; while ((c = getchar()) != EOF) { if ( c > 0x80) c -= 0x80; if ( c < 0x40) putchar(c); else printf("%s",transtbl[c-0x40]); } } Q: What are the eight-bit schemes? A: For the IBM mainframe world, which includes the ES (edinnaja sistema) clones of 360-370 mainframes, the basic scheme, called DKOI-8, extends EBCDIC by putting the Cyrillic letters in the unused slots, mostly in the rectangle 0x8a to 0xff (first hex digit >=8, second digit >=a). The mysteries of EBCDIC/ASCII conversion go beyond the scope of this document, and in the table that follows I will ignore 8-bit ascii values below 0xa0 and refer the reader to Dimitri Vulis' excellent document, which sheds some light on the IBM meaning of the characters 0x80-0x9f which are reserved in both IS0 8859-1 (Latin-1) and 8859-5 (Cyrillic). /* From 8859-5 to DKOI-8. ebcdic(isoval) = isotoibm[isoval-160] */ int isotoibm[96] = { 0x41,0xaa,0x4a,0xb1,0x9f,0xb2,0x6a,0xb5, 0xbd,0xb4,0x9a,0x8a,0x5f,0xca,0xaf,0xbc, 0x90,0x8f,0xea,0xfa,0xbe,0xa0,0xb6,0xb3, 0x9d,0xda,0x9b,0x8b,0xb7,0xb8,0xb9,0xab, 0x64,0x65,0x62,0x66,0x63,0x67,0x9e,0x68, 0x74,0x71,0x72,0x73,0x78,0x75,0x76,0x77, 0xac,0x69,0xed,0xee,0xeb,0xef,0xec,0xbf, 0x80,0xfd,0xfe,0xfb,0xfc,0xad,0xae,0x59, 0x44,0x45,0x42,0x46,0x43,0x47,0x9c,0x48, 0x54,0x51,0x52,0x53,0x58,0x55,0x56,0x57, 0x8c,0x49,0xcd,0xce,0xcb,0xcf,0xcc,0xe1, 0x70,0xdd,0xde,0xdb,0xdc,0x8d,0x8e,0xdf }; There are minor variations to DKOI, called Cyrillic Extended Code Page 037 (most common on BITNET), CECP 500 (which is the definitive one), the "JNET" and the "FORTRAN" mappings. The differences between these are tabulated below. Notice that EBCDIC/DKOI, unlike ASCII, is not uniquely defined even on the 0-127 range: 8859-5 037 500 JNET FORTRAN 0x21 0x5a 0x4f 0x5a 0x4f exclamation point (bang) 0x5b 0xba 0x4a 0xad 0x4a opening square bracket 0x5d 0xbb 0x5a 0xbd 0x5a closing square bracket 0x5e 0xb0 0x5f 0x5f 0x5f circumflex accent 0x7c 0x4f 0xbb 0x6a 0x4f logical or (vertical bar) [a2] 0x4a 0xb0 0x43 0x43 centsign (in 037)/capital dje (in 500) [ac] 0x5f 0xba 0x54 0x54 logical not (in 037)/capital kje (in 500) 0xd5 0xef 0xef 0xbb 0xad small ie 0xe3 0x46 0x46 0x4a 0xbb small u 0xe5 0x47 0x47 0xfc 0xbd small kha 0xfc 0xdc 0xdc 0x6a 0xfc small kje For the Internet, the most important code seems to be Old KOI-8, widely used in the Relcom groups (but probably not a whole lot elsewhere). Old KOI-8 (GOST 19768-74) from 1974 more or less follows Latin transliteration order and does not include upper-case hard sign, or letters common to other Slavic Cyrillic alphabets (Bulgarian, Macedonian, Serbian, Ukrainian...). In the 0-127 range it is identical with ascii, and for the 192-254 region see the transtabl array above. Some software, including uunpack (also used in Sergej Ryzhkov's bml, aka Beauty Mail system for PCs) which is distributed by Relcom, force upper-case hard sign to 255, others (and the standard!) declare this incorrect, or perhaps reserve 255 for DEL. In an earlier version of Andrew Hume's <andrew@research.att.com> tcs, which supports conversion across a wide variety of Cyrillic encodings, this was called the "mystery DOS Cyrillic encoding", except that his sha and shcha seem to be interchanged. Tcs is available for anon ftp from research.att.com in directory /dist/tcs.shar.Z. The semantics of 128-191 in Old KOI is unclear to me. If there is an official code page (it was suggested that Xenix users might have one), please post it. For the PC community, Code Page 866 seems to be quite important. This is what Microsoft is using in its russified version of MS-DOS. In 0-31 ascii control chars are replaced by a random selection of dingbats. In 32-126 it is identical to ascii, and in 127 it has something that looks like a little house (the interpretation of such positions seems to be subject to much uncertainty). The Russian part (128-255) is identical to Brjabrin's alternativnyj variant, except for 242-251, where some of the accents/symbols of AV are replaced by non-Russian Cyrillic characters and other symbols. Unfortunately CP 866 covers only Ukrainian and Belorussian, with the vague suggestion that e.g. Macedonian users could redefine the six non-Russian Cyrillic positions. This problem is largely resolved in Code Page 1251, the Microsoft Cyrillic Windows 3.1 character set, (also endorsed by WordPerfect and Adobe), which contains all Cyrillic letters used by modern Slavic languages. CP 1251 is fully compatible with ascii on 0-127 (leaves control positions undefined), has the Russian alphabet (in order, but without io) in 192-256, and puts the non-Russian Cyrillic, Russian io, and a few symbols in 128-191. Brjabrin's Alternativnyj Variant (AV) is also widely used on PCs. It has Russian in 128 to 175 in alphabetical order except for yo, graphics characters in 176 to 223, again Russian in 224-241. The same set of graphics characters, but not in the same order, is used in Brajabin's Osnovnoj Variant: they are similar to, but not identical with, IBM Extended ASCII graphics chars (neither the set of shapes nor the code values are the exact same). AV and OV have no non-Russian Cyrillic or accented characters, but four accent marks are provided: 242 (acute below the symbol), 243 (grave below the symbol), 244 (acute above the symbol), and 245 (grave above the symbol). These, as well as upper case and lower case yo, codes 240 and 241, are in the same position in Osnovnoj Variant as well. Codes 246 - 249 are arrows, pointing right, left, down, up, in that order. Codes 250 and 251 are, in both sets described by Briabrin, the division sign and the plus/minus sign (the latter becomes a radical sign in 866). 252 is the Number symbol, 253 is a sunburst, and 254 is "end of proof". 255 is in principle unused -- in practice people put things there. For the academic community, the lack of accents is remedied by the Academic version of AV developed at Cornell, which includes upper and lower case acute-accented vowels, and lower case grave-accented vowels. These replace all but six of the graphics characters (the six that were retained are those that are necessary for drawing a single-line box). The accented vowels in this set include a grave-accented lower case yo. Also included are the letters with diacritics used in French, German, and Spanish. The complete chart and DOS/Windows software may be requested from Exceller Software Corp. 800-426-0444. (This is NOT a product endorsement -- I haven't even seen the stuff!) Cornell also developed an Academic version of CP1251. In this, non-Russian Slavic languages are not supported: their letters have been replaced by Russian accented vowels. These include upper and lower case acute-accented vowels, and lower case grave-accented vowels. Also included are upper and lower case grave-accented yo. The AcademicFont Cyrillic character set was developed by University Microcomputers, who pioneered the use of Slavic languages on IBM-compatible computers in the US in the mid-eighties. This set is included among the 11 sets in Exceller's product. It supports Slavic and some non-Slavic languages, but not accented vowels. For the Macintosh community, there is a separate code page. It is ascii below 128, has the Russian capital letters in 128-159 in alphabetical order (as usual, io is treated separately) and the Russian lowercase letters in 240-254, but lower case ja is moved to 239, its place taken by the sunburst symbol. In the 160-238 range we finde the same set of (ISO 8859-5) non-Russian Cyrillic characters as in CP 1251. The symbols that appear here are also largely the same as in 1251, but the orderings are completely different and a few symbols are unique to one or the other, e.g. permille in 1251, capital delta in the Mac encoding. While a Macintosh version capable of character conversion is still on the drawing boards, for most other platforms Columbia Kermit is capable of converting between a large variety of Cyrilic encodings. Anon ftp to watsun.cc.columbia.edu: for C-Kermit 5A(188) (Unix, VMS, OS/2, Amiga etc) get file kermit/b/ckaaaa.hlp, read it, take it from there. For MS-DOS Kermit 3.11, get (in binary mode) kermit/bin/msvibm.zip, then unzip. For IBM Mainframe Kermit 4.2 and later, get kermit/b/ik0*.* plus one of the following: kermit/b/ikc*.* for VM/CMS, kermit/b/ikt*.* for MVS, kermit/b/ikx*.* for CICS or kermit/b/ikm*.* for MUSIC. There is also a large collection of character-set tables under kermit/charsets. Finally, the most broadly accepted standard outside these communities seems to be GOSTSCI (GOSTCII), a term used colloquially to refer to Brjabrin's Osnovnoj Variant or to ISO 8859-5 (which is also ECMA 113), although these two are not identical when it comes to non-Russian Cyrillic. The term "New KOI-8" means the 1987 revision of KOI-8 (GOST 19768-87) -- all these use the same (alphabetical, except for yo) order as 8859/5, starting with A at 176. However, the non-Russian Cyrillic characters (160-176 and 240-255 in new KOI-8) are not part of OV, their space is taken up by some graphics chars described for AV above. ISO 8859-5 provides for the Cyrillic characters required for writing all major Slavic Cyrillic alphabets (Belorussian, Bulgarian, Macedonian, Serbian, Ukrainian...), but not for those alphabets that were devised for non-Slavic languages in the Soviet Union (Abkhazian, Bashkir, Chukchee, Khanty, Tajik, ....), or archaic letters. Q: Is this a big mess or what? A: To straighten this out, it seems necessary to adopt a fixed point of reference, which I take to be Unicode V1.1 = ISO 10646-1.2. While in principle 10646 is a four-byte standard and Unicode uses 16-bit integers, the "Basic Multilingual Plane" of 10646 is by definition identical to the values assigned in Unicode 1.1, both being two-byte quantities (called UCS-2 by ISO). The following list gives the essential part of the names of the Cyrillic characters and the last two hex digits of their Unicode/10646 encoding. For reasons of space, the official Unicode/10646 names have been abbreviated. For a full list of names, anon ftp to unicode.org, cd to pub/MappingTables, and get namesall.lst (which is slightly over 200k). To get back the full official name from the abbreviations, always add the prefix CYRILLIC, unless the position is UNUSED. Further, expand CAP (SMA) to CAPITAL (SMALL). Finally, the word LETTER should be added after CAP/SMA, unless it is THOUSANDS, LIGATURE, or COMBINING. The numerical code values given in the second column have also been abbreviated to the last two digits, since the preceding two hex digits (really signifying "Cyrillic") are always 04 in Unicode/10646. The third column gives the-two character mnemonic abbreviations suggested in Keld Simonsen's RFC1345 where they exist, to facilitate cross-reference to this document (available by anon ftp e.g. from sunsite.unc.edu as /pub/doc/rfp/rfp1345.txt.Z) which has tables for Serbian, Macedonian, as well as other Cyrillic encodings (IBM CP 880, INIS-cyrillic = ISO-IR-51, ECMA-cyrillic = ISO-IR-111) whose domain of usage is unclear to me, and whose table for Old KOI seems to be in fact a New KOI table. I will add conversion tables for these (or for any other) encodings provided a real user community exists and actually generates some public domain machine-readable texts. UNUSED 00 CAP IO 01 IO CAP DJE 02 D% CAP GJE 03 G% CAP E 04 IE CAP DZE 05 DS CAP I 06 II CAP YI 07 YI CAP JE 08 J% CAP LJE 09 LJ CAP NJE 0A NJ CAP TSHE 0B Ts CAP KJE 0C KJ UNUSED 0D CAP SHORT U 0E V% CAP DZHE 0F DZ CAP A 10 A= CAP BE 11 B= CAP VE 12 V= CAP GE 13 G= CAP DE 14 D= CAP IE 15 E= CAP ZHE 16 Z% CAP ZE 17 Z= CAP II 18 I= CAP SHORT II 19 J= CAP KA 1A K= CAP EL 1B L= CAP EM 1C M= CAP EN 1D N= CAP O 1E O= CAP PE 1F P= CAP ER 20 R= CAP ES 21 S= CAP TE 22 T= CAP U 23 U= CAP EF 24 F= CAP KHA 25 H= CAP TSE 26 C= CAP CHE 27 C% CAP SHA 28 S% CAP SHCHA 29 Sc CAP HARD SIGN 2A =" CAP YERI 2B Y= CAP SOFT SIGN 2C %" CAP REVERSED E 2D JE CAP IU 2E JU CAP IA 2F JA SMA A 30 a= SMA BE 31 b= SMA VE 32 v= SMA GE 33 g= SMA DE 34 d= SMA IE 35 e= SMA ZHE 36 z% SMA ZE 37 z= SMA II 38 i= SMA SHORT II 39 j= SMA KA 3A k= SMA EL 3B l= SMA EM 3C m= SMA EN 3D n= SMA O 3E o= SMA PE 3F p= SMA ER 40 r= SMA ES 41 s= SMA TE 42 t= SMA U 43 u= SMA EF 44 f= SMA KHA 45 h= SMA TSE 46 c= SMA CHE 47 c% SMA SHA 48 s% SMA SHCHA 49 sc SMA HARD SIGN 4A =' SMA YERI 4B y= SMA SOFT SIGN 4C %' SMA REVERSED E 4D je SMA IU 4E ju SMA IA 4F ja UNUSED 50 SMA IO 51 io SMA DJE 52 d% SMA GJE 53 g% SMA E 54 ie SMA DZE 55 ds SMA I 56 ii SMA YI 57 yi SMA JE 58 j% SMA LJE 59 lj SMA NJE 5A nj SMA TSHE 5B ts SMA KJE 5C kj UNUSED 5D SMA SHORT U 5E v% SMA DZHE 5F dz CAP OMEGA 60 SMA OMEGA 61 CAP YAT 62 Y3 SMA YAT 63 y3 CAP IOTIFIED E 64 SMA IOTIFIED E 65 CAP LITTLE YUS 66 SMA LITTLE YUS 67 CAP IOTIFIED LITTLE YUS 68 SMA IOTIFIED LITTLE YUS 69 CAP BIG YUS 6A O3 SMA BIG YUS 6B o3 CAP IOTIFIED BIG YUS 6C SMA IOTIFIED BIG YUS 6D CAP KSI 6E SMA KSI 6F CAP PSI 70 SMA PSI 71 CAP FITA 72 F3 SMA FITA 73 f3 CAP IZHITSA 74 V3 SMA IZHITSA 75 v3 CAP IZHITSA DOUBLE GRAVE 76 SMA IZHITSA DOUBLE GRAVE 77 CAP UK DIGRAPH 78 SMA UK DIGRAPH 79 CAP ROUND OMEGA 7A SMA ROUND OMEGA 7B CAP OMEGA TITLO 7C SMA OMEGA TITLO 7D CAP OT 7E SMA OT 7F CAP KOPPA 80 C3 SMA KOPPA 81 c3 THOUSANDS SIGN 82 NON-SPACING TITLO 83 NON-SPACING PALATALIZATION 84 NON-SPACING DASIA PNEUMATA 85 NON-SPACING PSILI PNEUMATA 86 UNUSED 87 UNUSED 88 UNUSED 89 UNUSED 8A UNUSED 8B UNUSED 8C UNUSED 8D UNUSED 8E UNUSED 8F CAP GE WITH UPTURN 90 G3 SMA GE WITH UPTURN 91 g3 CAP GE BAR 92 SMA GE BAR 93 CAP GE HOOK 94 SMA GE HOOK 95 CAP ZHE WITH RIGHT DESCENDER 96 SMA ZHE WITH RIGHT DESCENDER 97 CAP ZE CEDILLA 98 SMA ZE CEDILLA 99 CAP KA WITH RIGHT DESCENDER 9A SMA KA WITH RIGHT DESCENDER 9B CAP KA VERTICAL BAR 9C SMA KA VERTICAL BAR 9D CAP KA BAR 9E SMA KA BAR 9F CAP REVERSED GE KA A0 SMA REVERSED GE KA A1 CAP EN WITH RIGHT DESCENDER A2 SMA EN WITH RIGHT DESCENDER A3 CAP EN GE A4 SMA EN GE A5 CAP PE HOOK A6 SMA PE HOOK A7 CAP O HOOK A8 SMA O HOOK A9 CAP ES CEDILLA AA SMA ES CEDILLA AB CAP TE WITH RIGHT DESCENDER AC SMA TE WITH RIGHT DESCENDER AD CAP STRAIGHT U AE SMA STRAIGHT U AF CAP STRAIGHT U BAR B0 SMA STRAIGHT U BAR B1 CAP KHA WITH RIGHT DESCENDER B2 SMA KHA WITH RIGHT DESCENDER B3 CAP TE TSE B4 SMA TE TSE B5 CAP CHE WITH RIGHT DESCENDER B6 SMA CHE WITH RIGHT DESCENDER B7 CAP CHE VERTICAL BAR B8 SMA CHE VERTICAL BAR B9 CAP H BA SMA H BB CAP IE HOOK BC SMA IE HOOK BD CAP IE HOOK OGONEK BE SMA IE HOOK OGONEK BF PALOCHKA C0 CAP SHORT ZHE C1 SMA SHORT ZHE C2 CAP KA HOOK C3 SMA KA HOOK C4 UNUSED C5 UNUSED C6 CAP EN HOOK C7 SMA EN HOOK C8 UNUSED C9 UNUSED CA CAP CHE WITH LEFT DESCENDER CB SMA CHE WITH LEFT DESCENDER CC UNUSED CD UNUSED CE UNUSED CF CAP A WITH BREVE D0 SMA A WITH BREVE D1 CAP A WITH DIAERESIS D2 SMA A WITH DIAERESIS D3 CAP LIGATURE A IE D4 SMA LIGATURE A IE D5 CAP IE WITH BREVE D6 SMA IE WITH BREVE D7 CAP SCHWA D8 SMA SCHWA D9 CAP SCHWA WITH DIAERESIS DA SMA SCHWA WITH DIAERESIS DB CAP ZHE WITH DIAERESIS DC SMA ZHE WITH DIAERESIS DD CAP ZE WITH DIAERESIS DE SMA ZE WITH DIAERESIS DF CAP ABKHASIAN DZE E0 SMA ABKHASIAN DZE E1 CAP I WITH MACRON E2 SMA I WITH MACRON E3 CAP I WITH DIAERESIS E4 SMA I WITH DIAERESIS E5 CAP O WITH DIAERESIS E6 SMA O WITH DIAERESIS E7 CAP BARRED O E8 SMA BARRED O E9 CAP BARRED O WITH DIAERESIS EA SMA BARRED O WITH DIAERESIS EB CAP U WITH ACUTE EC SMA U WITH ACUTE ED CAP U WITH MACRON EE SMA U WITH MACRON EF CAP U WITH DIAERESIS F0 SMA U WITH DIAERESIS F1 CAP U WITH DOUBLE ACUTE F2 SMA U WITH DOUBLE ACUTE F3 CAP CHE WITH DIAERESIS F4 SMA CHE WITH DIAERESIS F5 CAP DJE WITH ACUTE F6 SMA DJE WITH ACUTE F7 CAP YERU WITH DIAERESIS F8 SMA YERU WITH DIAERESIS F9 UNUSED FA UNUSED FB UNUSED FC UNUSED FD UNUSED FE UNUSED FF Q: Is everything clear now? A: Probably not. To ease the pain, here follow some tentative conversion tables *from* the 8-bit schemes described above *to* Unicode. Since the Unicode/10646 character set is much larger, no tables are provided in the other direction. In the 0-127 range everything is ASCII (except for the CP866 dingbats in the range 0-31 which are at any rate optional, and for EBCDIC/DKOI-8, for which see above) so here tables are only provided for 128-255. Notice that often values other than starting with 0x04 are given, meaning that the Unicode equivalent is outside the Unicode Cyrillic range 0x0400-0x04ff, but included at some other place, typically among the arrows (0x2190-0x21ff) or other semigraphic material (0x2500-0x25ff). If a particular encoding leaves (by official definition, not necessarily in practical usage) some code unused, this is designated by "-1" in the conversion table. For some positions the tables show a "-2", meaning that I have no information on the intended meaning. (This is not the same as there being no Unicode codepoint for the character in question, a situation we potentially encounter with AV and OV 242-245, see note there.) /* From old Koi-8 to Unicode */ long oldkoi8tou[128] = { -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, 0x044e,0x0430,0x0431,0x0446,0x0434,0x0435,0x0444,0x0433, 0x0445,0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e, 0x043f,0x044f,0x0440,0x0441,0x0442,0x0443,0x0436,0x0432, 0x044c,0x044b,0x0437,0x0448,0x044d,0x0449,0x0447,0x044a, 0x042e,0x0410,0x0411,0x0426,0x0414,0x0415,0x0424,0x0413, 0x0425,0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e, 0x041f,0x042f,0x0420,0x0421,0x0422,0x0423,0x0416,0x0412, 0x042c,0x042b,0x0417,0x0428,0x042d,0x0429,0x0427,0x042a }; /* From CP866 to Unicode */ long cp866tou[128] = { 0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417, 0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f, 0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427, 0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f, 0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437, 0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f, 0x2591,0x2592,0x2593,0x2502,0x2524,0x2561,0x2562,0x2556, 0x2555,0x2563,0x2551,0x2557,0x255d,0x255c,0x255b,0x2510, 0x2514,0x2534,0x252c,0x251c,0x2500,0x253c,0x255e,0x255f, 0x255a,0x2554,0x2569,0x2566,0x2560,0x2550,0x256c,0x2567, 0x2568,0x2564,0x2565,0x2559,0x2558,0x2552,0x2553,0x256b, 0x256a,0x2518,0x250c,0x2588,0x2584,0x258c,0x2590,0x2580, 0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447, 0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f, 0x0401,0x0451,0x0404,0x0454,0x0407,0x0457,0x040e,0x045e, 0x00b0,0x2022,0x00b7,0x221a,0x2116,0x00a4,0x25a0, -1 }; /* From CP1251 to Unicode */ long cp1251tou[128] = { 0x0402,0x0403,0x201a,0x0453,0x201e,0x2026,0x2020,0x2021, -1,0x2030,0x0409,0x2039,0x040a,0x040c,0x040b,0x040f, 0x0452,0x2018,0x2019,0x201c,0x201d,0x2022,0x2013,0x2014, -1,0x2122,0x0459,0x203a,0x045a,0x045c,0x045b,0x045f, 0x00a0,0x040e,0x045e,0x0408,0x00a4,0x0490,0x00a6,0x00a7, 0x0401,0x00a9,0x0404,0x00ab,0x00ac,0x00ad,0x00ae,0x0407, 0x00b0,0x00b1,0x0406,0x0456,0x0491,0x00b5,0x00b6,0x00b7, 0x0451,0x2116,0x0454,0x00bb,0x0458,0x0405,0x0455,0x0457, 0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417, 0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f, 0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427, 0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f, 0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437, 0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f, 0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447, 0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f, }; /* From Mac to Unicode */ long mactou[128] = { 0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417, 0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f, 0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427, 0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f, 0x2020,0x00b0,0x0490,0x00a3,0x00a7,0x2022,0x00b6,0x0406, 0x00ae,0x00a9,0x2122,0x0402,0x0452,0x2260,0x0403,0x0453, 0x221e,0x00b1,0x2264,0x2265,0x0456,0x03bc,0x0491,0x0408, 0x0404,0x0454,0x0407,0x0457,0x0409,0x0459,0x040a,0x045a, 0x0458,0x0405,0x00ac,0x221a,0x0192,0x2248,0x0394,0x00ab, 0x00bb,0x2026,0x0020,0x040b,0x045b,0x040c,0x045c,0x0455, 0x00b0,0x00b1,0x0406,0x0456,0x0491,0x00b5,0x00b6,0x00b7, 0x040e,0x045e,0x040f,0x045f,0x2116,0x0401,0x0451,0x044f, 0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437, 0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f, 0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447, 0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x00a4, }; /* From Alternativnyj Variant to Unicode */ long avtou[128] = { 0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417, 0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f, 0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427, 0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f, 0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437, 0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f, 0x2591,0x2592,0x2593,0x2502,0x2524,0x2561,0x2562,0x2556, 0x2555,0x2563,0x2551,0x2557,0x255d,0x255c,0x255b,0x2510, 0x2514,0x2534,0x252c,0x251c,0x2500,0x253c,0x255e,0x255f, 0x255a,0x2554,0x2569,0x2566,0x2560,0x2550,0x256c,0x2567, 0x2568,0x2564,0x2565,0x2559,0x2558,0x2552,0x2553,0x256b, 0x256a,0x2518,0x250c,0x2588,0x2584,0x258c,0x2590,0x2580, 0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447, 0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f, 0x0401,0x0451,0x0317,0x0316,0x0301,0x0300,0x2192,0x2190, 0x2193,0x2191,0x00f7,0x00b1,0x2116,0x00a4,0x25a0, -1 }; /* The interpretation of the four symbols following the second alphabetic block in AV remains unclear. One suggestion was to treat these as (non-spacing) grave and acute, as appearing above upper- or lowercase letters, but the graphical rendering in Briabin's original article makes clear that the distinction is between acute and grave, above or below the letter: this is what the table now has. But the preponderance of graphical symbols in AV suggests that the intention was to provide facilities for character graphics, in which case the interpretation is simply straight lines connecting two adjacent midpoints of the bounding box. If the box is the unit square, these would run from (.5,0) to (0,.5) and to (1,.5), and from (.5,1) to (0,.5) and to (1,.5), in this order. (The line segments are of course directionless.) Such symbols are not present in Unicode -- the closest things are 0x25de 0x25df 0x25dc 0x25dd (in this order) but these are curved, not straight. Whether the graphics or the accent usage is more prevalent in actual usage only those plugged into the Russian PC community can tell. If the graphics usage turns out to be prevalent, these four symbols would be reasonable candidates for incorporation into Unicode, perhaps at positions 0x25ef to 0x25f3. */ /* From Osnovnoj Variant to Unicode */ long ovtou[128] = { -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, 0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417, 0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f, 0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427, 0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f, 0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437, 0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f, 0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447, 0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f, 0x0401,0x0451,0x0317,0x0316,0x0301,0x0300,0x2192,0x2190, 0x2193,0x2191,0x00f7,0x00b1,0x2116,0x00a4,0x25a0, -1 }; /* The same problem with the interpretation of 242-245 as in AV (these rows are definitely identical). The low positions of OV are probably identical to 176-223 in AV... */ /* From ISO8859-5 to Unicode */ long newkoi8tou[128] = { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0x00a0,0x0401,0x0402,0x0403,0x0404,0x0405,0x0406,0x0407, 0x0408,0x0409,0x040a,0x040b,0x040c,0x00ad,0x040e,0x040f, 0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417, 0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f, 0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427, 0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f, 0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437, 0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f, 0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447, 0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f, 0x2116,0x0451,0x0452,0x0453,0x0454,0x0455,0x0456,0x0457, 0x0458,0x0459,0x045a,0x00a7,0x045c,0x045d,0x045e,0x045f }; /* Use newkoi8tou in combination with isotoibm to derive the unicode meaning of the Cyrillic range in the DKOI extension of EBCDIC. If someone has DKOI-8 text available, I'd love to actually try... */